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DNA microarrays are devices that are able, in principle, to detect and quantify the presence 
of specific nucleic acid sequences in complex biological mixtures. The measurement consists in 
detecting fluorescence signals from several spots on the microarray surface onto which different 
probe sequences are grafted. One of the problems of the data analysis is that the signal contains 
a noisy background component due to non-specific binding. This paper presents a physical model 
for background estimation in Affymetrix Genechips. ft combines two different approaches. The 
first is based on the sequence composition, specifically its sequence dependent hybridization affinity. 
The second is based on the strong correlation of intensities from locations which are the physical 
neighbors of a specific spot on the chip. Both effects are incorporated in a background functional 
which contains 24 free parameters, fixed by minimization on a training data set. In all data analyzed 
the sequence specific parameters, obtained by minimization, are found to strongly correlate with 
empirically determined stacking free energies for RNA/DNA hybridization in solution. Moreover, 
there is an overall agreement with experimental background data and we show that the physics- 
based model proposed in this paper performs on average better than purely statistical approaches for 
background calculations. The model thus provides an interesting alternative method for background 
subtraction schemes in Affymetrix Genechips. 

PACS numbers: 87.15.-v, 82.39. Pj 



I. INTRODUCTION 

DNA microarrays have become a powerful tool to mon- 
itor the gene expression level of thousands of genes si- 
multaneously on a genome-wide scale (for a recent re- 
view see for instance Ref. p|). Microarrays are based 
on the hybridization between the surface-bound DNA 
sequences (called probes) and DNA or RNA sequences 
in solution (called targets). The probes are designed to 
have a sequence exactly complementary to that of the 
desired target sequence one wishes to detect in solution. 
As the target molecules in solution are labelled with flu- 
orescent markers, the amount of hybridized targets can 
be determined by means of optical measurements. The 
fluorescence intensity measured at a specific spot on the 
microarray reflects the concentration of complementary 
targets in the used sample solution. 

One of the most prominent commercial platforms of 
DNA microarrays is provided by Affymetrix By 
virtue of in-situ photolithographic techniques Affymetrix 
produces arrays in which more than one million different 
probes are grafted on a single chip. The probes are 25 
nucleotides long sequences of single-stranded DNA. As 
a single 25-mer may not provide reliable measurements 
of the expression level of one specific gene, Affymetrix 



chooses 10-16 fragments of different regions for each gene, 
which together form a so-called probe set. Each probe 
set is to uniquely characterize a given gene. 

One of the problems of the data analysis is that the 
measured fluorescence signal does not only contain in- 
formation about the concentration of a speciflc gene in 
solution, but also of other sources of hybridization with 
fragments which only partially overlap with the surface- 
bound sequence. Thus, the measured fluorescence of a 
given probe site can be written as 

I = h+Isp{c) (1) 

where Isp{c) is the speciflc contribution of the signal 
which depends on the concentration c of the complemen- 
tary target in solution and Iq is a background signal. 
The aim of this work is to introduce a new model which 
is based upon inputs from physical chemistry for the cal- 
culation of Iq for Affymetrix arrays. Identifying the main 
sources of background intensity is crucial in order to make 
accurate and reliable estimates of gene expression levels 
mainly for weakly expressed genes, for which /sp(c) « Iq. 

A peculiarity of Affymetrix Genechips is that probes 
come in pairs: a probe, the so-called perfect match (PM), 
has a sequence exactly complementary to the sequence in 
solution. A second probe, physically located as neighbor 
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of the PM in the chip, has a single non-complementary 
base with respect to the specific target. The latter is 
known as mismatch (MM). Originally, MM's were sup- 
posed to estimate only the non-specific hybridization, 
i.e. it was expected that Imm ~ la, so that from cq. ([1]) 
one could have estimated Isp{c) = IpM — Imm- How- 
ever, this approach experiences some difficulties as in 
some chips as many as 30% of the MM intensities are 
higher than the corresponding PM's [1] (although this 
seems to occur predominantly in low intensities regimes, 
where both PM and MM signals may be dominated by 
non-specific hybridization [J|). Moreover, it has been 
found that Imm also depends on the concentration in 
solution of the almost complementary target sequence. 
Hence the background adjustment based on the differ- 
ence IpM — Imm currently does not receive much con- 
sensus and other strategies have been devised JM- For a 
discussion of MM hybridization see also Refs. [al3, Si- 



ll. MODEL 

Our approach to estimate the background intensity is 
twofold. First, we make use of the property of Affymetrix 
microarrays that neighboring probes have similar se- 
quences, and hence also similar affinities for non-specific 
binding. We recall that a fluorescence image from an 
Affymetrix chip is contained in a file giving the [x, y) co- 
ordinate of the probe and the corresponding measured 
intensity. By setup a PM probe is located at (x, y), with 
odd y, and the corresponding MM probe is located at 
(a;, y + 1). Hence the chip is arranged in rows of PM and 
MM sequences, as shown in Fig. [TJ PM and MM pair 
probes share all nucleotides but the middle (13*'') one. 
Hence there is a strong sequence correlation between the 
rows with odd y and the rows at y + 1. But the sequences 
of neighbors along the x-direction are also correlated, as 
part of the microarray design. 



Due to its central importance, the modeling of back- 
ground intensities is not new. One can distinguish 
here between models using purely statistical treatment 
[1) [13, [Oj and others where physical inputs co ming 
from equilibrium thermodynamics were employed [l3l.ll4l 
m, [3, • A more extensive discussion of previous stud- 
ies in relation with our results is postponed to the final 
section of this paper. 



In this paper we present a new method to estimate the 
background noise of Affymetrix gene expression arrays. 
We construct a functional which contains 24 parameters, 
fixed by minimization on a set of training data. The 
functional takes into account the physical chemistry of 
hybridization by a subset of the 24 parameters. These 
parameters depend on sequence composition and which 
are equivalent to the stacking free energies in the nearest- 
neigbor model [3 ■ We also exploit the observation that 
the background signal of a given site strongly correlates 
with the intensities measured on neighboring sites. The 
accuracy of the results is tested on a set of spike- in data in 
which transcripts are added in solution at known concen- 
tration. In particular, being interested in the accuracy 
of our background predictions, we focus on the data at 
zero concentration. The model developed in this paper 
reproduces the spike- in data very well and in this particu- 
lar case it performs better than other popular algorithms 
used for background adjustment in Affymetrix expression 
chips. 



This paper is organized as follows: the background 
functional is introduced in Sec. |TT1 The results of the 
minimization are given in Sec. IIIIl where they are tested 
on the spike-in data set and compared with the predic- 
tion of other algorithms. Finally in Sec. IIVI we present 
a general discussion of the results obtained and provide 
some general conclusions. 
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FIG. 1: Schematic view of the two main ingredients for the 
background functional developed in this paper. (Left) Back- 
ground intensity is correlated to the fluorescence signal mea- 
sured from neighboring spots. (Right) The background de- 
pends also on the sequence dinucleotide composition and on 
the relative distance of the dinucleotides from the surface, 
i.e. inhomogeneities are taken into account. 



The main idea of our approach is to use MM intensi- 
ties as background estimates for the PM signals only for 
genes that are sufficiently low expressed, i.e. for which 
both PM and MM signals are low on the global scale of 
intensities in Affymetrix chips. An estimator is built up 
and optimized around these low-intensity data (to be de- 
fined more precisely later), which can then be applied to 
the whole chip, thus also in the high intensity regimes. 
Let us consider a MM at position (x, y). Because of corre- 
lations with the neighboring sequences its intensity value 
will be correlated to the intensities of the neighboring 
sites in the chip. In particular, we consider the weighted 
average of intensities of the two neighboring MM's at po- 
sitions {x ± 1, y), of the corresponding PM (x, ?/ — 1) and 
of the two PM's at positions (a; ± 1, j/ — 1), as shown in 
Fig. [H Differences in sequences tend to cause gaussian 
fluctuations in the effective affinities, hence in the loga- 
rithm of the intensities r\{x^ y) = ln/(a;, y), rather than in 
the intensities. The local dependence of the background 
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functional takes thus the form 

Viocai{x,y) = po+pif]{x + l,y)+p2v{x-l,y) 

+ P3 v{x + l,y -1) +P4v{x -l,y- 1) 
+ P5v(x,y- 1), (2) 

in which pi,i = . . .5 are weight factors, constrained by 

A completely different indicator of the background in- 
tensity is purely based on the probe sequence. A well- 
known model to estimate the affinity between a DNA 
probe and its complementary RNA target is the nearest- 
neighbor model [l8|. Here, the affinity is given by a sum- 
mation over pairs of neighboring nucleotides, in which 
each term can take 16 different values, depending on 
whether the sequence is AA, AC, AG, AT, . . . , GG. We 



expect that the background signal is due to the binding 
to the probe sequence of short fragments of sequences 
from other genes, which are complementary to the probe 
over some fraction of its total length. We introduce 16 
pair-strengths Paf3 (with a,/? G {A, T, C,G}) as fitting 
parameters. 

To approximately incorporate the effects of "unzip- 
ping" of the DNA/RNA hybrid on its top and bottom, 
we add a parabolic weighting as a function of the posi- 
tion along the probe, around the middle of the probe at 
km = 12^. In the fabrication of the chip the majority 
of probes does not reach its full length of 25 nucleotides. 
The effect of length variation is modeled by linear devi- 
ations in this weighting function as well. In total, this 
yields 



24 

Vseqis) ^J2J2^^f^^^^P'^P [1 + (fc - k'n)pi + (fc - k^f Pp] (3) 

k=l a,f3 



with a,l3 e {A,T,C,G} and where 

gk ^^^-j ^ fl ifsfc = aands/c+i = /? 
1 otherwise. 

Here Sfc indicates the fc-th oligonucleotide of the se- 
quence s{x, y) of a total length of 25 letters. Summing 
over all possible letters a, (3 is equivalent to counting 
the frequency of each pair a[3 within a given sequence 
s{x,y). The 16 parameters Paf3 reflect the influence 
of each pair a/S on the background intensity. Accord- 
ing to the nearest-neighbor model, the parameters can 
be used to describe the formation of RNA/DNA hy- 
brid duplexes [l^. Also here, we expect that our ap- 
proximations lead to a more or less gaussian spread- 
ing in the effective affinities. The sequence-based es- 
timation of the background intensity is then given by 

Iseq{x,y) = exp(77seg(s)). 

We then combine the two different estimates for the 
background affinity with arbitrary weights: 

In I{x, y;s)= ri{x, y;s) = ijiocai {x,y) + Vseq (s) (5) 

where the relative weight for the first estimate is absorbed 
in the parameters pi — we no longer inforce the restric- 
tion X!i=o Pi ~ ^ — ^'^'i the relative weight for the second 
estimate in the parameters Pap ■ 

We proceed by constructing a cost function whose min- 
imization allows to obtain estimates of the 24 parameters 
in Eq. ( [5]) . We write the cost function as an average over 
all probes of the squared difference between the actual 
background affinity and the prediction r](x,y; s'): 



^^nT. [log^MAf - vix, y; s')f. (6) 

s'{x,y) 

Here, s'{x, y) is a subset of N sequences which includes 
only sequences of those MM intensities whose corre- 
sponding PM intensities are below a certain threshold 
I (in Affymetrix units) and which themselves do not ex- 
ceed I to exclude bright MMs from the analysis Q . 

Equation ([6]) incorporates Affymetrix' original idea of 
using MM intensities as background measures. Strict 
selection rules need to be imposed on the input data 
{s'{x,y)) to ensure that only those experimentally ob- 
tained values of Imai are used which can be clearly at- 
tributed to background noise, and not to hybridization 
with the target complementary to the corresponding PM 
probe. But how do we find a criteria which identifies and 
filters the undesired probes? Fortunately, the Spike-In 
data at concentration c ~ (for details, see IIII Ap can 
be used as reference for background noise. By compar- 
ing the Imm histograms of the input and Spike-In data 
(c = 0), a threshold intensity i can be found such that 
both histograms are strongly correlated. In the present 
work, the threshold intensity is set to z = 350 resulting 
in a discard of 25 — 30% of the data. For comparison, 
saturated probes have intensities around 12,000. 

The optimization algorithm used to perform the min- 
imization of the cost function given in Eq. ([5]) is steep- 
est descent with damped newtonian dynamics, in which 
Eq. ^ is interpreted as potential energy. The value ob- 
tained from the minimization procedure can be used as 
a measure for the quality of the attained minimum. 
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III. RESULTS 
A. Experimental data 

In the present work, we analyze data from AfFymetrix 
microarray experiments which are pubhcly available un- 
der [iO, [2l|. The results (scanned intensities) of each 
experiment are saved in a so-called "CEL-file" (exten- 
sion .CEL). For each probe, the CEL-file contains infor- 
mation about its physical location on the chip (x- and 
y-coordinate) and the mean intensity. The CEL-file does 
not contain any information about the probe set name 
or sequence. For further processing the so-called CDF- 
file (chip description file) is needed. Each CEL-file is 
associated with a CDF-file which allows to retrieve the 
information necessary to map each probe to its corre- 
sponding probe set. The sequence information can be 
found in the probe-tag-file. The probe-tag-file contains 
the name of each probe, its location, an Affymetrix spe- 
cific probe interrogation position, the sequence and the 
target strandedness. The latter file is particularily useful 
if one wishes to investigate the sequence dependence of 
the measured intensities. 

Affymetrix offers a large palette of gene expression 
arrays for different organisms. In this work, we focus 
on the analysis of two human genome chipsets, namely 
HGU95A and IIGU133A, and two non-human organisms, 
the african clawed frog (Xenopus Laevis) and the ze- 
brafish (Danio Rerio). All four chipsets are used, in a 
first step, to investigate and validate the correlation be- 
tween well-known hybridization stacking energies and the 
16 parameters of Eq. (O (see Section Fill C|) . As second 
step, we focus our attention to a subset of the IIGU95A 
and HGU133A datasets, the so-called Latin Square Ex- 
periments [20j|. Those experiments serve as calibration 
experiments as some target sequences are added at con- 
trolled concentrations ( "spiked-in" ) to a background ref- 
erence solution. The target concentrations range from 
pM to 1024 pM. Since the spikc-in experiments at zero 
concentration measure pure background, we use them as 
benchmark for our background functional Eq. ([5]). 

B. Neighbor-dependent parameters 

As discussed in Section |lll the intensities of neighbor- 
ing probes can be used to estimate the non-specific bind- 
ing of a given probe, because of the design of Affymetrix 
microarrays. However, Eq. ^ takes only five neighbors 
into account although each spot on the array is sur- 
rounded by in total eight neighbors — four direct and 
four diagonal neighbors. Eq. ^ originally included eight 
parameters but it turned out that the intensity correla- 
tions with the "top" neighbors at (y -I- 1) (see Fig. [T]) 
on the background intensity are much smaller than the 
{y — 1) row. The analysis of the correlation between se- 
quences which are neighbors in the array explains why 
the (y -f- 1) neighbors are less useful. 



For the IIGU133A array all four nucleotides are 
roughly equally present, i.e. A, C, G, T densities are 
0.239, 0.248, 0.243 and 0.269. A consequence is that 
with two randomly chosen nucleotides, the probability of 
finding the same letter is 25.05%. However, the proba- 
bility of finding the same letter at the fc-th position at 
sites in {x,y) and {x + l,j/) (or equally at {x — 1,?/)) 
is 48.29%. This probability increases to 96% when con- 
sidering the neighbors {x,y) and {x,y — 1), for even y, 
which is not surprising as the probes at these locations 
are a pair of PM and MM, which share 24 out of 25 
oligonucleotides. The probability of finding the identical 
nucleotide at {x, y + l) for y even is 37.58%. We thus see 
that the sequence correlation along the x-direction clearly 
exceeds the correlation along the y-direction except when 
considering corresponding PM and MM probes. Because 
of this, the three "top" neighbors were not considered 
any further in order to restrict the computational effort 
to a minimum (see Fig. [Ij. 

The functional minimization shows another interesting 
pattern: the three closest neighbor parameters pi , p2 and 
P5 are positively correlated with the background signal, 
while the diagonal neighbors and show negative cor- 
relations (the correspondence between pi and positions 
can be deduced from Eq. ([2|)). This result (i.e. pi, p2, 
P5 > and p3, p4 < 0) is found in all chips analyzed. 
Typical average outputs on human genome chips of the 
Latin Square experiments are 

{pi,...,P5} ~ {0.06,0.08,-0.04,-0.03,0.35}. (7) 

The interpretation is as follows: The MM signal is 
most strongly correlated with its corresponding PM, as 
reflected by the magnitude of p^. The sequence at {x,y) 
is closely correlated with the two MM neighbors {x±l,y) 
(parameters pi andp2), i.e. a strong background intensity 
at {x±l,y) corresponds to a strong background at (x, y). 
However, a strong signal of the MM probes at (x ± 1, y) 
may also be caused by the presence of complementary 
target molecules at high concentrations in solution. The 
functional corrects for this with negative coefficients for 
the signals at positions (a; ± 1, y — 1) (parameters ps and 

P4)- 

C. Sequence dependent parameters 

The nearest-neigbor model is widely used to describe 
the thermodynamics of duplex formation of nucleic acids 
in solution as it yields good approximations of the se- 
quence dependence of duplex stability (see e.g. [i3|)- It 
is based on the assumption that the stability of each base 
pair depends on the identity and orientation of the adja- 
cent base pairs. For a given sequence s oi N nucleotides 
the hybridization free energy is given by: 

^G=J2 ^'5^;3(s)AG„/3 + AGinit.(si,sjv), (8) 

fc=l a, 13 
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where AGafS are the stacking free energies associated to 
a pair of nucleotides a/3; (5^^ counts the frequency of the 
pairs a(3 along the sequence and was defined in Eq. (j4]). 
In Eq. ([5]) we have added a term which depends on the 
end nucleotides si and sjv and it is referred to as helix 
initiation parameter AGinit- 

The parameters AH^p and AS'^^ from which one ob- 
tains AG = AiJ — TAS are known from hybridization 
experiments in solution. Due to symmetry considera- 
tions, there are only 10 independent AG^a in the case 
of DNA/DNA duplexes (see Table 2 of [23|). There are 
no such symmetries in RNA/DNA duplexes hence there 
are in total 16 parameters, which were determined ex- 
perimentally by Sugimoto et al. [l^. Even though the 
nearest-neighbor model was originally developed to cal- 
culate duplex free energies in solution, it provides reason- 
able approximations to describe the energetics involved 
in the hybridization processes on Affymetrix microar- 
rays [3, A recent experimental study ^2§\ on a 

class of spotted arrays in which hybridization of perfect 
matching and multiple mismatching probes were ana- 
lyzed, showed that the data are well described by nearest- 
neighbor parameters for duplex formation in solution. 

Our approach to model the background intensity in- 
volves the determination of 24 parameters of which the 
16 parameters Pap reflect the influence of each pair a/3 
on the background intensity. The relationship between 
the 16 parameters PafS and the 16 stacking parameters 
AGq^ is quickly derived. According to the Langmuir 
model the measured intensity / at a given site is related 
to the hybridization free energy AG via 



(9) 



Recalling that the sequence dependent functional rjseq 
given in Eq. ([3]) is fltted to the logarithm of the intensity 
(see Eq. dSJ) one expects that the parameters Pap are 
linearly related to the the stacking free energy parameters 
AGap- To verify inhowfar this linear relationship holds, 
all 16 parameters Pap are calculated for each CEL-file, 
i.e. each chip by the minimization of the cost function 
Eq. ([6]). Then, each pap is averaged over all available 
CEL-files of a given chipset and plotted as a function 
of AGap given in Ref. [l9j. Two of these plots for the 
Latin square set are shown in Fig. [51 The plots indicate 
that the linear relationship between pap and AGap is 
approximately verified. The correlation coefficients for 
the linear fit are typically about 0.83. The Tablc|T]reports 
the correlation coefficients for H. Sapiens, X. Laevis and 
D. Rerio chipsets. The results show that our ansatz to 
include the nearest-neighbor model in the background 
estimation is justified and the influence of the pairs is 
not to be neglected. 




0.5 



1 1.5 

AG „(kcal/mol) 



FIG. 2: Parameters Pai3, as obtained from the minimiza- 
tion of Eq. (O on a training data set, plotted as function 
of AGa0, the nearest- neighbor stacking free energy obtained 
from DNA/RNA hybridization in solution The two fig- 
ures refer to (a) average of 19 experiments of the HGU95 
(1521) spike in data set (b) average of 42 experiments of the 
HGU133A spike in data set. The error bars are the standard 
deviation. Notation of DNA pairs are from 5' to 3' ends. The 
straight hues are linear fits to the points. The correlation co- 
efficients of the fits referring to these and to other experiments 
analyzed are given in Table U 



D. Benchmark: Spike In Data 

To test the accuracy of the predicted background signal 
as given in Eq. (O, we turn our attention to the spike-in 
data. Concerning background analysis, we are naturally 
most interested in the c — spike-in data as the measured 
signal is pure background noise. By virtue of Eq. ^ we 
calculate the background signal i] for a given probe set 
of the Latin Square data for the chipsets HGU95A and 
HGU133A. 

Fig. [3] is representative for the results of HGU95A and 
HGU133A. In general, we find that the predicted back- 
ground intensity rj nicely follows the PM/MM intensi- 
ties of the spike- in experiments at zero concentration and 
hence really describes the shape of the background. One 
would expect the PM and MM values at zero concentra- 
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Chipset 


# of CEL-files Corr. coefT 


HGU95A (1521) 


19 


0.870 


HGU95A (1532) 


19 


0.869 


HGU95A (2353) 


19 


0.861 


HGU133A 


42 


0.791 


XL (GSE 3334) 


6 


0.805 


XL (GSE 3368) 


20 


0.806 


XL (GSE 4448) 


31 


0.792 


DR (GSE 5048) 


6 


0.847 



(a) 



X PM expl 
• MM expl 




TABLE L Correlation coefficients of the linear fits of the 
Pa/3 parameters obtained from minimization of the back- 
ground functional and the hybridization free energies AGa^ 
for RNA/DNA duplex formation in aqueous solution taken 
from Ref. [l^. The data are for human chipsets (HGU95A 
sets, HGU133A) of the Affymetrix Latin square experiment 
and of Xenopus Laevis (XL) and Danio Rerio (DR) arrays. 



tion to be almost identical; this is mostly the case as the 
median value of the difference (PM-MM)jjQygg^ — 28 
for HGU95A shows. This value is even smaller for the 
HGU133A chipset, i.e. (PM-MM)hqui33a = 21. Excep- 
tions where either the PM or MM intensity clearly ex- 
ceeds the median difference suggest the presence of tran- 
script fragments which are complementary to the probe 
over a length of more nucleotides than one would statis- 
tically expect when considering background issues. Es- 
pecially the origin of bright MM's has been investigated 
intensively in the recent past (see e.g. [1, d, B])- 



E. Comparison to other approaches 

Figure 3] compares the performance of our background 
functional rj to three of the most commonly used al- 
gorithms, namely MAS5.0 [ij, IHI, RMA ^ and 
GCRMA [24|]. MAS5.0 is a commercial software for data 
analysis developed by Affymetrix. For our calculations 
we used the free version of MAS5.0 available under the 
open project Bioconductor [Sj. RMA and GCRMA are 
two variants of the same type of algorithm, both freely 
available from Bioconductor. 

In order to compare the performance of the background 
subtraction schemes, we calculated 



J2^^0g IpM 



log/b] 



(10) 



Probe No. 



(b) 



X PM expl 
• MM expl 
►— » 1 




Probe No. 




6 8 
Probe No. 



FIG. 3: Signal intensities for PM (crosses) and MM (cir- 
cles) for three probe sets plotted as function of the probe 
numbers. The data are for three spikes at c = (zero con- 
centration means that the target are absent from the solu- 
tion). The probe sets are (a) 38734_at (HGU95A - 15211), (b) 
AFFX-r2-TagE_at (HGU133A - Expt4_Rl) and (c) 209795_at 
(HGU133A - Exptl3_Rl). The solid line shows the back- 
ground estimate based on the functional of Eq. ([5} . 



i.e. the average squared deviation of the predicted back- 
ground signal It from MAS5.0, RMA, GCRMA and from 
our algorithm with respect to the experimental back- 
ground intensity IpM- The sum in Eq. (|10p runs over 
all M probes in the Affymetrix spike-in experiments at 
concentration c = 0. 

The examples of Fig. 2] show that MAS5.0 underesti- 
mates the background values and hardly deviates from a 



straight line. MAS5.0 uses the lowest 2% of probe inten- 
sities of each region of a chip to estimate a background 
value. Each probe intensity is then background corrected 
based upon a weighted average of each of the background 
values. A detailed description can be found in [l^, [ill . 
The background adjustment method used by RMA [17[ 
uses a global model for the distribution of probe intensi- 
ties. It is based on empirical findings on the distribution 
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(b) 



Probe No. 



X 


PM expl 


• 


MM expl 


♦— ♦ 


n 




RMA 


T T 


MAS5.0 





- 500 

- 400 




10 12 14 16 



Probe No. 




FIG. 4: Examples of comparison of the performance of 
MAS5.0 (triangles down), RMA (triangles up) and GCRMA 
(squares) with the algorithm developed in this paper (dia- 
monds). The crosses (PM) and circles (MM) are the zero 
concentration spike-in data. The data shown are for the probe 
sets (a) 36202_at (HGU95A - 1521g), (b) 1708.at (HGU95A 
- 1532b) and (c)209606_at (HGU133A - ExptlO_Rl). 



of probe intensities and only considers PM values as in- 
put as well as output. However, RMA does not take non- 
specific binding into account which often leads to an un- 
derestimation of the background. GCRMA [23| is based 
on RMA and includes sequence information to calculate 
a so-called affinity measure. The results of GCRMA ex- 
cel those of RMA and MAS5.0. However, we have found 
that in some cases after background subtraction GCRMA 
gives a higher value of the intensity compared to the orig- 



inal data, which signifies a negative background correc- 
tion. For these points we have set logh = 1 in Eq. pU)) . 

Table [III reports the value of the mean squared de- 
viation calculated from Eq. pn]) . Smaller values of this 
parameter signify a more accurate algorithm for the back- 
ground estimation. The Table indeed shows that globally 
our physical-chemistry based algorithm, indicated as col- 
umn ry, outperforms the three other statistical-based al- 
gorithms. As already anticipated by the graphs in Fig.|4l 
the performance of GCRMA is generally far better than 
MAS5.0 and RMA. Our algorithm improves further on 
GCRMA in all cases analyzed, except for the last set 
(HGU95A expertiment 2353) of Table [m 



d 


V 


RMA MAS5.0 GCRMA 


HGU133A 
HGU95A-1521 
HGU95A-1532 
HGU95A-2353 


0.161 
0.163 
0.203 
0.099 


0.521 1.589 0.194 
0.760 1.127 0.200 
0.698 1.041 0.343 
0.508 0.777 0.088 



TABLE II: Average squared deviation of four human genome 
chipsets according to Eq. ((TD} where I = IpM. 



IV. DISCUSSION 

We have introduced a new model to predict back- 
ground intensities in Affymetrix GeneChips. Our model 
takes into account the physical-chemistry involved in 
hybridization as well as the influence of the design of 
Affymetrix microarrays. The background functional de- 
veloped in this paper contains two terms given by Eq. ^ 
and Eq. ^ that reflect these two contributions. 

The sequence-based background estimate (Eq. ^) 
includes 16 pair-strength-parameters Pap- Physical- 
chemistry arguments suggest that these parameters are 
correlated to the hybridization free energies AGa/j for the 
corresponding couple of nucleotides. One expects an ap- 
proximate linear relationship between the two. The fact 
that the parameters PafS are indeed linearly correlated to 
the hybridization free energies in solution, as shown in 
Fig. [21 suggests that the model presented here captures 
the origin of the background correctly. We recall that 
hybridization in Affymetrix expression arrays is between 
a DNA strand at the microarray surface and an RNA 
strand in solution, therefore the hybridization free ener- 
gies to compare with are those for RNA/DNA duplexes. 
These were determined experimentally by Sugimoto et 
al. [19]. It is worth mentioning that a previous study [l^ 
of microarray data analysis using physical-chemistry in- 
puts, although in a different way than what is depeloped 
here, reported a weaker correlation (r — 0.6) between 
fitted affinities and the experimental parameters by Sug- 
imoto et al. [Tgj . In the experimental data considered in 
this study we find a correlation coefficient ranging from 
r = 0.79 to r = 0.87 (see Table [J) for the three different 
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organisms analyzed. In our opinion, a good correlation 
with experimental stacking free energies provides a first 
important test of reliability of the analysis. 

In our model, a second contribution to the background 
functional is given by the intensities at the locations that 
are physical neighbors on the microarray (Eq. ((2])). The 
neighbors influence is understood as coming from the 
fact that neighboring locations have similar sequences, 
as a consequence of the design of AfFymetrix microar- 
rays: similar sequences imply similar background contri- 
butions. The local contribution to the background de- 
pends on five parameters which measure the strength of 
the correlations. As pointed out in Sec. IIIIBl the magni- 
tude and signs of these parameters can be understood in 
terms of sequences similarities. 

We compared the background intensities predicted by 
the functional presented in this paper with the experi- 
mental data. The latter are spike-in Affymetrix data f^oj 
in which few sequences are added in solution at known 
concentration. The spike-in data set is used to develop 
and test algorithms for AfFymetrix microarrays data anal- 
ysis. In particular we considered the data at zero spike-in 
concentration, which measure pure background. We used 
these data to compare the performance of our algorithm 
to the other algorithms MAS5.0, RMA and GCRMA. 



This comparison is summarized in Table |TT1 showing the 
average squared deviation from the logarithm of the in- 
tensities at zero spike-in concentration. The results show 
that our algorithm and GCRMA perform much better 
than both MAS5.0 and RMA. In the tests performed we 
noticed that GCRMA follows closely the experimental 
background, but it may "fail" substantially in few probes 
of a probe set. This can also be seen in the examples of 
Fig- HI These failures lower the performance of GCRMA, 
compared to the physical-chemistry algorithm presented 
here. 

In conclusion, the algorithm developed in this paper 
provides good quality results for background estimates 
compared to existing algorithms and provides an inter- 
esting alternative for background subtraction schemes in 
AfFymetrix Genechips. Even though we have shown that 
the performance of our background functional is satisfy- 
ing, hopeFuUy there is still room For improvement. 
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